This note is under rapid change, and may contain mis-information or redundent content, including some extended information not mentioned in the summer school
2023-07-19, Huang: "Statistical physics of Neural Network"
Network Name
Developer Name
Year
Introduction
Multilayer Perceptron (MLP)
Frank Rosenblatt
1958
A simple and widely used type of neural network that consists of an input layer, one or more hidden layers, and an output layer of artificial neurons.
Network
John
1982
A type of recurrent neural network that can store and retrieve patterns as stable states. It consists of a single layer of fully connected neurons with symmetric weights and binary activations.
Recurrent Neural Network (RNN)
David Rumelhart et al.
1986
A type of neural network that can process sequential data such as text, speech, and series. It consists of a hidden layer of neurons that have recurrent connections to themselves, forming a loop.
Convolutional Neural Network (CNN)
Yann LeCun
1989
A type of feedforward neural network that can process high-dimensional data such as images, videos, and speech. It consists of multiple layers of neurons that perform convolution operations on the input data, followed by pooling layers and fully connected layers.
Spiking Neural Network (SNN)
Eugene Izhikevich et al.
2003
A type of neural network that mimics the behavior of biological neurons more closely than other types. It consists of spiking neurons that communicate with each other using discrete pulses or spikes.
This list can go on and on, along with the history of winters and springs of AI. But how to understand the neural network in a more general way?
Some well-established theories in the history can still be used today
MLP
MLP is defined as:
x↦WDσD−1(WD−1σD−2(…σ1(W1x))),
where Wi is the weight matrix of the i-th layer, σi is the activation function of the i-th layer, and D is the depth of the network.
where Vi(t) is the state of neuron i at time t, Tij is the weight of the synapse from neuron j to neuron i, and Ui is the threshold value for neuron i.
This definition mimics the biological neurons like Fig.1, anything cool?
Computational ability that can generalize, categorize, correct errors, and recognize familiarity.
If we want to store a binary vector Vs∈{0,1}n to it, we can set the weight matrix as:
Tij=s=1∑n(2Vis−1)(2Vjs−1)
Here 2V−1∈[−1,1] is the binary to bipolar transformation.
同种状态相互促进,不同状态相互抑制
Here weight matrix is fixed, unlike the MLP.
To retrieve a stored memory from an initial state V(0), the neurons are then updated randomly and asynchronously by the Hopfield network dynamics until the network reaches a stable state.
The update rule is asynchronous and random? Why not synchronous and deterministic?
The update rule is not important, the energy function is.
We can define the energy function of the network as:
E=−21i,j=1∑NTijViVj+i=1∑NUiVi
The first term is the interaction energy between neurons, and the second term is the external energy of the neurons.
It's proved that the energy function is a Lyapunov function of the Hopfield network dynamics, which means that the energy function will decrease or remain constant at each step, until it reaches a minimum value that corresponds to a stable state of the network.
Do the stable states acutally correspond to the stored memories?
Not necessarily. The stable states of the Hopfield network are the local minima of the energy function, which may not correspond to the stored memories. Ising Model and Spin Glass provided the answer for different Tij.
This is the MC simulation result, local minima, or noise, exists. How to reduce it?
This paper find clipping the weights increase noise.
Generalize: Restore the stored memories from a corrupted version of the memories in a consistant way.
Categorize: Classify the input into one of the stored memories.
Correct errors: Correct the corrupted memories to the stored memories.
Recognize familiarity: Guide the initial state to nearest stored memories (hamming distance).
What cool application can be made out of this? In principle: MNIST digit recognition, image denoising, and even generative models.
Btw, Restricted Boltzmann Machine (RBM) is a special case of Hopfield network.
Hands-on session!
For coding convenience, we can define the Hopfield network as:
Hopfield network is a Graph G=(V,T), where V is the set of vertices and W is the set of edges. The state of the network is a binary vector V(t)∈{−1,1}n, where n is the number of vertices. The weight matrix T is defined as:
Tij=s=1∑nVisVjs
where Vis is the i-th element of the s-th stored memory Vs.
<!-- In Progress -->
Curie-Weiss Model
Curie-Weiss model is a mean-field model of ferromagnetism, which was introduced by Pierre Curie and Pierre Weiss in 1906. It is used to describe the behavior of ferromagnetic materials above the Curie temperature, where the magnetic moments of the atoms are randomly oriented. The model assumes that the magnetic moments are aligned in the same direction, and it predicts that the magnetization of the material is proportional to the applied magnetic field.
Energy:
H=−21i<j∑Jsisj
This contains only 1 parameter, J.
The solution is
m=tanh(β(n−1)Jm)
where m is the magnetization, n is the number of particles, J is the interaction strength, and β is the inverse temperature.
The above model only considers two body interaction, but in reality, there's higher order interaction. (Not discussed yet)
Consider Ising model with hidden variable:
E=i,a∑siσaJia+i<j∑sisjJij+a<b∑σaσbJab
where si is the spin of the i-th particle, σa is the spin of the a-th hidden variable, Jia is the interaction strength between particle i and hidden variable a, Jij is the interaction strength between particles i and j, and Jab is the interaction strength between hidden variables a and b.
But we can only observe si, so we need to marginalize over σa:
p(s)=σ∑p(s,σ)=σ∑Ze−βE(s,σ)
The loss is
L=μ=1∑mlogσ∑e−βE(xμ,σ)/Z
This is the same as the above.
Similar to the above, we take derivative of it with respect to Jia:
This is slow using MCMC. Hinton proposed a faster way: RBM. Which is a bipartite graph where hidden variable is independent of each other, and visible variable is independent of each other too.
Hinton proposed Contrastive Divergence (CD) algorithm to solve this problem.
It provides a clear criterion to distinguish between the chaos phase and the memory phase in asymmetric neural networks with associative memories, based on the eigenvalue spectra of the synaptic matrices.
It reveals a novel phenomenon of eigenvalue splitting in the memory phase, and shows that the number and positions of the split eigenvalues are related to the number and stability of the memory attractors.
It develops a mean-field theory to derive analytical expressions for the eigenvalue spectra and the dynamical properties of the neural network, and verifies them with numerical simulations.
A new proposal:
find different phases and order parameters of NN. Integrate existing works.
The NN can be MLP, Hopfield, RNN, CNN, SNN, etc.
One work relating to the grokking behavior in NNs:
but it does not offer any meaningful insight into the phase diagram.
This so-called phase diagram/phase transition is an abuse of terms in physics. Limited theoretical insights. More discussion see OpenReview of this paper
A reviewer claim that:
There has been a couple of papers that use tools from physics and provide phase diagrams for understanding the generalization of neural networks:
Generalisation error in learning with random features and the hidden manifold model by Gerace et al
Multi-scale Feature Learning Dynamics: Insights for Double Descent by Pezeshki et al
The Gaussian equivalence of generative models for learning with two-layer neural networks by Goldt et al
where Z=∑xe−E(x) is the normalization factor.
we can learn E(x) from the data by minimzie
L=−∣D∣1x∈D∑lnp(x)
for example, if we set E(x)=−21Wijxixj, then all we need is the gradient of L w.r.t. Wij:
∂Wij∂L=−21(⟨xixj⟩x∼D−⟨xixj⟩x∼p)
and we can use gradient descent to minimize it
Wij′=Wij+η(⟨xixj⟩x∼D−⟨xixj⟩x∼p)
What makes the algorithm costy is the generation of x∼p, which is also updated and generated every step using MCMC. To deal with this, one may use the samples from D, for example the last batch, to replace p.
In practice, when we trained the energy function, we often treat it as the following form
where G is the generator, D is the discriminator, pdata(x) is the data distribution, pz(z) is the noise distribution, and G(z) is the generated data.
Contrast Learning
Contrast Learning is a general framework for unsupervised learning. The idea is to learn a representation that makes the data distribution and the noise distribution distinguishable.
Score-Based Diffusion Model
Jacot e.t.a. 2018
Statistical Physics of Learning
玻璃态
非磁金属中的磁性杂质:AuFe 导致自旋磁矩方向无规,长程无序。
Glass state is a non-equilibrium, non-crystalline condensed state of matter that exhibits a glass transition when heated towards the liquid state.
The replica method is a technique to deal with quenched disorder in statistical physics, such as in spin glass models. The idea is to compute the average of the logarithm of the partition function by introducing replicas of the system and taking the limit of zero replicas. The partition function of n replicas can be written as:
where σia is the spin variable of the i-th site and the a-th replica, Jij is the random coupling between sites i and j, P(Jij) is the probability distribution of the couplings, β is the inverse temperature, and h is the external magnetic field.
The replica method assumes that there is a unique analytic function that interpolates the values of Zn for integer n, and that this function can be analytically continued to real values of n. Then, one can write:
n→0limnZn−1=logZ
The average free energy per spin can then be obtained by:
f=−n→0limβnN1logZn
where N is the number of spins in the system.
The main difficulty in applying the replica method is to find a suitable representation for the order parameter that describes the correlations between replicas. This order parameter is usually an n×n symmetric matrix Qab, where Qab is the overlap between replicas a and b, defined as:
Qab=N1i∑σiaσib
The matrix Qab encodes the structure of the phase space of the system, and how it is partitioned into different states or clusters.
The simplest assumption for the matrix Qab is that it is invariant under permutations of replicas, meaning that all replicas are equivalent. This is called the replica symmetric (RS) Ansatz, and it implies that:
Qaa=0
Qab=qa=b
where q is a constant that measures the average overlap between replicas.
However, it turns out that the RS Ansatz is not valid for some systems, such as the Sherrington-Kirkpatrick (SK) model of spin glasses. In these systems, there are many metastable states that are separated by large energy barriers, and different replicas can explore different regions of phase space. This leads to a more complicated structure for the matrix Qab, which requires breaking replica symmetry.
Replica symmetry breaking (RSB) is a way to generalize the RS Ansatz by allowing for different values of overlaps between replicas, depending on how they are grouped or clustered. The most general form of RSB is called full RSB, and it involves an infinite hierarchy of breaking levels. However, for some systems, such as the SK model, a simpler form of RSB is sufficient, called one-step RSB (1-RSB).
In 1-RSB, one divides the replicas into n/m groups of size m, where m is a real parameter between 0 and 1. Then, one assumes that:
Qaa=0
Qab=q1a,b∈same group
Qab=q0a,b∈different groups
where q1>q0 are constants that measure the intra-group and inter-group overlaps, respectively.
The 1-RSB Ansatz captures the idea that there are clusters of states that have a higher overlap within them than between them. The parameter m controls how probable it is to find two replicas in the same cluster.
Using the 1-RSB Ansatz, one can compute the free energy per spin as:
where Dz=dz2πq0exp(−2q0z2) is a Gaussian measure.
The self-consistency equations for q1, q0, and m can be obtained by extremizing the free energy with respect to these parameters. The solution of these equations gives the correct description of the low-temperature phase of the SK model, as proven by Parisi [6].
A general procedure of the replica method is as follows:
The crazy part of Replica method is that: We want to calculate logZ, but turns out be evaluating a diferent value, and they equals each other magically! (for only SK model, no proof yet for other methods)
Cavity Method and 2-layer Perceptron
Cavity Method
The cavity method is a technique to compute the average of a function of a random variable in a large system, by using a recursive equation that describes the correlations between the variable and its neighbors. It is used in statistical physics to compute the free energy of spin glass models, and in computer science to analyze the performance of message passing algorithms on random graphs.
大偏差理论
ML algorithms don't stop at the saddle points that exists indepdently, but stay on the large (although it might seem rare) basins of attraction.
Subdominant dense clusters allow for simple learning and high computational performance in Neural netowkrs with dicere synaps.
Quantum Many-body Physics Introduction
Quantum Many-body Physics
Quantum many-body physics is the study of the behavior of systems made of many interacting particles, where quantum mechanics plays an essential role. It is used to describe a wide range of physical phenomena, from the behavior of electrons in metals and semiconductors, to the superfluidity of liquid helium, the Bose-Einstein condensation of ultracold atoms, and the superconductivity of certain materials at low temperatures.
Quantum Many-body Problem
The quantum many-body problem is the problem of finding the ground state of a quantum system with many interacting particles. It is one of the most challenging problems in physics, due to the exponential complexity of the Hilbert space of the system. The difficulty of the problem depends on the dimensionality of the system, the type of interactions, and the symmetries of the Hamiltonian.
Define
H=i=1∑N2mpi2+i<j∑V(ri−rj)
where ri is the position of the i-th particle, pi is its momentum, m is its mass, and V(ri−rj) is the interaction potential between particles i and j.
The ground state of the system is the state with the lowest energy, and it is the state that the system will tend to occupy at zero temperature.
Quantum Many-body Problem in 1D
In one dimension, the quantum many-body problem can be solved exactly using the Bethe ansatz. The Bethe ansatz is a method to find the eigenstates of a system with integrable Hamiltonian, by writing them as a linear combination of plane waves. It was first introduced by Hans Bethe in 1931 to solve the Heisenberg model of ferromagnetism, and it was later generalized to other models such as the Hubbard model of interacting electrons.
Quantum Many-body System introduction
A quantum many-body system is a system with many interacting particles, where quantum mechanics plays an essential role. It is used to describe a wide range of physical phenomena, from the behavior of electrons in metals and semiconductors, to the superfluidity of liquid helium, the Bose-Einstein condensation of ultracold atoms, and the superconductivity of certain materials at low temperatures.
铁基超导体
锰氧化物
competing order: different order parameters compete with each other.
Exponential complexity of the Hilbert space of the system: 2N, where N∼1023.
Superconductivity
The superconductivity can be defined as:
Meissner effect:Zero resistance:Cooper pair:BCS theory:perfect diamagnetismperfect conductivitybound state of two electronsCooper pair condensation
During the writing of this note, LK-99, a material that claims to have superconductivity, is attracting public attention.
Can theory really guide the dicovery of novel material? The answer seems to be NO until now.
Experiment facts:
Tc is the critical temperature of the superconducting phase transition.
Specific heat jump at Tc.
isotopic effect: Tc∝M−α, where M is the mass of the atom, and α∼0.5.
Theory explaining isotopic effect: BCS theory.
BCS theory basics:
Cooper pair: bound state of two electrons.
Cooper pair condensation: the ground state of the system is a condensate of Cooper pairs.
BCS wavefunction: ∣Ψ⟩=∏k>0(uk+vkck†c−k†)∣0⟩, where ck† is the creation operator of an electron with momentum k, and uk and vk are the coefficients of the wavefunction.
BCS Hamiltonian: H=∑kξkck†ck+∑k,k′Vkk′ck†c−k†c−k′ck′, where ξk is the kinetic energy of an electron with momentum k, and Vkk′ is the interaction potential between electrons with momenta k and k′.
BCS ground state: ∣Ψ⟩=∏k>0(uk+vkck†c−k†)∣0⟩, where uk and vk are the coefficients of the wavefunction.
BCS gap equation: Δk=∑k′Vkk′2Ek′Δk′tanh(2βEk′), where Δk is the gap parameter, Ek=ξk2+Δk2 is the quasiparticle energy, and β=1/kBT is the inverse temperature.
The critical temperature is given by: kBTc=1.14ωDe−1/λ, where ωD is the Debye frequency and λ is the coupling constant.
It explains the isotopic effect by :
The mass of the atom affects the interaction potential between electrons.
The interaction potential affects the gap parameter.
The gap parameter affects the critical temperature.
BCS Theory
Three key points:
Electrons are attracted to each other due to electron-phonon interaction.
Electrons form Cooper pairs.
Cooper pairs condense into a superconducting state.
Cooper Pair
Definition: A Cooper pair is a pair of electrons with opposite spins and wave vectors ±2ℏkF, where kF is the Fermi wave vector, that are bound together by an attractive interaction mediated by phonons.
Definition: The binding energy of a Cooper pair is the difference between the energy of two free electrons and the energy of a bound pair.
Theorem: Under the assumptions of a weak electron-phonon coupling and a screened Coulomb interaction, the binding energy of a Cooper pair is given by:
Eb=−g2(kF)n2(ωD)V−1
where g(kF) is the coupling constant at zero momentum transfer, n(ωD) is the Bose-Einstein distribution function for phonons at frequency ωD, which is the Debye frequency, and V is the volume of the system.
Superfluidity
PWAnderson: Imagine yourself in a room full of people, each of whom is wearing a pair of roller skates. You are not wearing roller skates. You want to move across the room, but you cannot walk. What do you do? You grab onto the hand of the person nearest you, and the two of you move together across the room. Then you let go, grab onto the hand of the next person, and so on. In this way, you can move across the room without ever putting on a pair of roller skates. This is how electrons move through a superconductor.
where bi† is the creation operator of a boson at site i, ni=bi†bi is the number operator at site i, t is the hopping amplitude, U is the on-site interaction strength, and μ is the chemical potential.
The ground state of the system is the state with the lowest energy, and it is the state that the system will tend to occupy at zero temperature.
The ground state of the system can be either a superfluid or a Mott insulator, depending on the values of the parameters t, U, and μ.
The superfluid phase is characterized by the presence of long-range phase coherence, which means that the phase of the wavefunction is the same at all sites. This allows the bosons to move freely through the lattice, and it leads to a finite superfluid density.
The Mott insulator phase is characterized by the absence of long-range phase coherence, which means that the phase of the wavefunction is different at different sites. This prevents the bosons from moving freely through the lattice, and it leads to a zero superfluid density.
Solve for the ground state of the system using the Gutzwiller ansatz, which is a variational ansatz that assumes that the ground state is a product state of the form:
∣Ψ⟩=i∏∣ψi⟩
where ∣ψi⟩ is the state of site i.
The Gutzwiller ansatz is exact for the Mott insulator phase, and it gives a good approximation for the superfluid phase.
Spin Glass: Theory and applications
Setting
N element, with property σi, energy Ei(σi).
element will interact.
Connections between ML and SP
2023-07-23 09:11:11 Pan Zhang, SP, ML, QC
ML is "essentially" fitting the joint probability distribution of the data.
Simulation methods in one field can be used in the other fields.
Do nature essentially computing?
Is the universe a computer?
Computation is defined as the process of transforming information from one form to another. It is a fundamental process that occurs in nature, and it is the basis of all physical phenomena.
Can we mimic nature's computation?
Can we harness nature's computation?
Why there's different phase and phase transition?
Free energy: Consider all possible configurations.
At low T, liquid phase is possible, but with vanishing probability.
Nature choose the phase with lowest free energy.
Will nature fail?
Super-cooled liquid: liquid phase with lower free energy than solid phase.
It might rest at some local minimum.
OK, somes no solid, but super-cooled liquid and glass. (This is insane, but reasonable, solved some partial problem of me)
More is different: exponential complexity space, but only a few phases.
Statistical mechanics from theoretical computer science perspective: computational complexity.
Statistical Physics and Statistical Inference is the same thing
Given a data set D={x}, where each sample x∈D has a label y. If we can learn the joint distribution p(x,y) from the dataset, and generate unseen samples based on the label, then it's called Generative Model.
Learn: given data x, label y, learn p(x,y)
Usage:
discrimitive task: p(y∣x)=p(x,y)/p(x)
generative task: p(x∣y)=p(x,y)/p(y)
This seems to be a trivial Bayesian estimation, but problem will arise when we are dealing with high dimensional distribution (i.e., dimx>>1) , since we need to fit a high dimensional curve (which will lead to curse of dimensionality).
To deal with this, we introduce some models that give a prior distribution function p(x) and learn the parameters to obtain the correct distribution.
This seems to be a trivial practice as it makes no difference from just using NNs to replace p(x,y)
The loss function is the difference between p(x) and sample distribution π(x), we minimize
KL(π∣∣p)=x∈D∑π(x)ln[p(x)π(x)]=⟨lnpπ⟩π
we may simplify it as
KL(π∣∣p)=⟨lnpπ⟩π=⟨lnπ⟩π−⟨lnp⟩π
in most cases, we treat π(x)=∣D∣1∑x′∈Dδ(x−x′), and all we need is to minimize −⟨lnp⟩π, which can be simplified as
L=−⟨lnp⟩π=−∣D∣1x∈D∑lnp(x)
this is the Negative Log Likelyhood (what's this?), to minimize this is to maximize the likelyhood.
Another way to understand:
Given data x, we want to learn the distribution p(x).
Define it as a partition function:
Z=p(x)=s∑p(x∣s)p0(s)
where s is the hidden variable, p0(s) is the prior distribution of s, and p(x∣s) is the conditional distribution of x given s.
If we given p(x∣s) and p0(s), we can compute p(x∣s), and then p(x).
p(x∣s)=p(s)p(x,s)=p(s)p(x∣s)p0(s)
Information Compression
Given y=f(x), we want to restore x from y. Where f(x)=Fx, F is a ParseError: KaTeX parse error: Undefined control sequence: \s at position 3: m \̲s̲ ̲n matrix, m<n.
In principle, we can't restore x from y, but we can find a x′ that is close to x. And if x is sparse, given enough y, we can restore x.
GPU is all you need: Current Boom in ML is due to GPU. NNs are not the best, but the most suitable for GPU.
Success of ML = Data + Algorithms + NNs + GPU
Given that computing hardware is so important, what is the currrent solution to moore law failure?
Quantum Computing
"Nature isn't classical, dammit, and if you want to make a simulation of nature, you'd better make it quantum mechanical, and by golly it's a wonderful problem, because it doesn't look so easy." - Richard Feynman
Neuromorphic Computing
"The brain is a computer made of meat." - Marvin Minsky
Optical Computing
"The future of computing is light." - John Hennessy
DNA Computing
"DNA is like a computer program but far, far more advanced than any software ever created." - Bill Gates
The above are analog computing, which have inherent problems of noise and precision. But the future of AGI might be analog computing.
Quantum Supremacy
Quantum Supremacy is the demonstration of a quantum computer performing a calculation that is beyond the reach of the most powerful supercomputers today.
The quantum state in the random circuit sampling problem is a superposition of all possible states, and it is given by:
∣ψ⟩=2n1x∈{0,1}n∑∣x⟩
where n is the number of qubits, and ∣x⟩ is the computational basis state with binary representation x.
We apply random single-qubit gates to the initial state, and we measure the final state in the computational basis. The probability of obtaining a given outcome x is given by:
p(x)=∣⟨x∣U∣ψ⟩∣2
where U is the unitary operator that represents the random circuit.
In the paper of Quantum Supremacy using a programmable superconducting processor, we have a circuit with 53 qubits, the classical computation cost can be estimated as:
ParseError: KaTeX parse error: Undefined control sequence: \s at position 8: 2^{53} \̲s̲ ̲4 \times 4
in storage and computation.
Tensornetwork: solve the problem on a hardware.
Statistical Physics of Learning
How Ising model be used for learning?
Ising model can be defined :
H=−i<j∑Jijsisj
where si is the spin of the i-th particle, and Jij is the interaction strength between particles i and j.
Energy function must be Polynomial- computable. Otherwise, this model won't work. (But why)
It satisfies boltzmann distribution:
p(s)=Ze−βH(s)
where β is the inverse temperature, and Z is the partition function.
Why boltzmann distribution?
Consider m microstates, with n spins.
We want to maximize the entropy:
S=−s∑p(s)lnp(s)
subject to the constraints:
<si>=m1s∑sip(s)<sisj>=m1s∑sisjp(s)
Which means the two variable equals the sample average.
Curie-Weiss model is a mean-field model of ferromagnetism, which was introduced by Pierre Curie and Pierre Weiss in 1906. It is used to describe the behavior of ferromagnetic materials above the Curie temperature, where the magnetic moments of the atoms are randomly oriented. The model assumes that the magnetic moments are aligned in the same direction, and it predicts that the magnetization of the material is proportional to the applied magnetic field.
Energy:
H=−21i<j∑Jsisj
This contains only 1 parameter, J.
The solution is
m=tanh(β(n−1)Jm)
where m is the magnetization, n is the number of particles, J is the interaction strength, and β is the inverse temperature.
The above model only considers two body interaction, but in reality, there's higher order interaction. (Not discussed yet)
Consider Ising model with hidden variable:
E=i,a∑siσaJia+i<j∑sisjJij+a<b∑σaσbJab
where si is the spin of the i-th particle, σa is the spin of the a-th hidden variable, Jia is the interaction strength between particle i and hidden variable a, Jij is the interaction strength between particles i and j, and Jab is the interaction strength between hidden variables a and b.
But we can only observe si, so we need to marginalize over σa:
p(s)=σ∑p(s,σ)=σ∑Ze−βE(s,σ)
The loss is
L=μ=1∑mlogσ∑e−βE(xμ,σ)/Z
This is the same as the above.
Similar to the above, we take derivative of it with respect to Jia:
This is slow using MCMC. Hinton proposed a faster way: RBM. Which is a bipartite graph where hidden variable is independent of each other, and visible variable is independent of each other too.
Hinton proposed Contrastive Divergence (CD) algorithm to solve this problem.
Implicit vs Explicit Generative Models
Given a parameterized distribution pθ(x), if we can explicitly compute the probability of a given sample x, then it is called an explicit generative model. Otherwise, it is called an implicit generative model.
Is RBM explicit or implicit?
Since pθ(x)=Ze−βE(x), we cannot explicitly compute the probability of a given sample x, so it is an implicit generative model.
Is Flow model explicit or implicit?
Since pθ(x)=p0(z)∏i=1nfi(zi), we can explicitly compute the probability of a given sample x, so it is an explicit generative model.
Is VAE explicit or implicit?
Since pθ(x)=∫dzpθ(x∣z)p0(z), we cannot explicitly compute the probability of a given sample x, so it is an implicit generative model.
Is Autoregressive model explicit or implicit?
Since pθ(x)=∏i=1npθ(xi∣x<i), we can explicitly compute the probability of a given sample x, so it is an explicit generative model.
GPT uses autoregressive model, so it is an explicit generative model.
Should the parameter be shared within the model?
Hong Zhao, XMU, 2023-07-24 09:00:21
"Statistical Physics is not of much use in Machine Learning, Statistical Physics maximize the entropy, but ML minimize it."
Serious?
Statistical Physics or Statistics?
Main topics in Statistical Physics:
Boltzmann distribution
Maxwell-Boltzmann distribution
Darwin-Fowler distribution
Gibbs distribution
Statistical Physics is yet to be fully developed compared to classical mechanics.
No clear boundaries of application between these theories
No physics theory is possible to describe the world we mainly see. That's why ML rises.
水流,天上的云,吹刮的风,身边的一切都处在远离平衡态的状态。
Statistics have the following topics:
Random variable
Probability distribution
Formal Definition of Statistical Physics
Given data x=(x1,x2,…,xn,p1,p2,…,pn) in Γ
, where xi is the coordinate of the i-th particle, pi is the momentum of the i-th particle, n is the number of particles.
(Micro-canonical ensemble) The probability of a given microstate x is given by:
p(x)=Ω(E)1δ(E−E(x))
where Ω(E) is the number of microstates with energy E, and δ(E−E(x)) is the Dirac delta function.
(Canonical ensemble) The probability of a given microstate x is given by:
p(x)=Z1e−βE(x)
where Z is the partition function, β is the inverse temperature, and E(x) is the energy of the microstate x.
(Macro-canonical ensemble) The probability of a given microstate x is given by:
p(x)=Ξ1e−βE(x)e−αN(x)
where Ξ is the grand partition function, β is the inverse temperature, α is the chemical potential, E(x) is the energy of the microstate x, and N(x) is the number of particles in the microstate x.
(Bose-Einstein distribution) The probability of a given microstate x is given by:
p(x)=Z1i=1∏neβ(Ei−μ)−11
where Z is the partition function, β is the inverse temperature, μ is the chemical potential, Ei is the energy of the i-th particle, and n is the number of particles.
(Fermi-Dirac distribution) The probability of a given microstate x is given by:
p(x)=Z1i=1∏neβ(Ei−μ)+11
where Z is the partition function, β is the inverse temperature, μ is the chemical potential, Ei is the energy of the i-th particle, and n is the number of particles.
Now we have a distribution over Γ, but what we see is an average over t. How to resolve this?
Ergodicity: average = ensemble average
Ergodicity hypothesis: the average of a physical quantity is equal to the ensemble average.
t→∞limt1∫0tA(x(t′))dt′=⟨A⟩
where A(x(t′)) is the value of the physical quantity A at t′, and ⟨A⟩ is the ensemble average of the physical quantity A.
Problem related to this hypothesis:
Poincare recurrence theorem: the system will eventually return to a state arbitrarily close to its initial state.
Loschmidt's paradox: the -reversed dynamics of a system is not the same as the original dynamics of the system.
Boltzmann's H-theorem: the entropy of an isolated system will increase over .
Difference from statistics:
Data generated by Hamiltonian dynamics.
A prior distribution is given by the Hamiltonian.
Do not assume ergodicity: Boltzman invented Γ space, but don't know how to proceed. "Boltzmann School" (which throws away the ergodicity hypothesis), add assume dynamic mixing. (No proof yet)
Work strictly when N→∞.
Can only be applied to little systems.
Why statistical physics have no mechanics in it? (i.e. no dynamic equation)
Liouville's theorem: the phase space volume is conserved under Hamiltonian dynamics.
But it cannot provide proof for ergodicity or equa-partition theorem.
Consider Noisy-Dynamics?
"Random School" developed by Einstein, Smoluchowski, Langevin, Fokker-Planck, Kolmogorov, etc.
Boltzmann equation:
∂t∂f+v⋅∇f=∂t∂f=C[f]
where f is the probability distribution function, v is the velocity, and C[f] is the collision operator.
Learning Machine
If the output can feedback to the system, then it's a dynamic system.
It provides a clear criterion to distinguish between the chaos phase and the memory phase in asymmetric neural networks with associative memories, based on the eigenvalue spectra of the synaptic matrices.
It reveals a novel phenomenon of eigenvalue splitting in the memory phase, and shows that the number and positions of the split eigenvalues are related to the number and stability of the memory attractors.
It develops a mean-field theory to derive analytical expressions for the eigenvalue spectra and the dynamical properties of the neural network, and verifies them with numerical simulations.
Other Model-free methods
Phase Space Reconstruction
Reservoir Computing Approach
Long Short-Term Memory (LSTM)
Time Delay Dynamical Learning
Which one is the best?
No camparison found in literature.
Dynamic Systems: A Prior
Lorentz system:
dtdx=σ(y−x)dtdy=x(ρ−z)−ydtdz=xy−βz
where x, y, and z are the state variables, σ, ρ, and β are the parameters, and t is the .
It exhibits chaotic behavior for σ=10, ρ=28, and β=8/3.
Phase space reconstruction:
xi=(xi,xi−τ,xi−2τ,…,xi−(m−1)τ)
where xi is the i-th data point, τ is the , and m is the embedding dimension.
Benchmark is the key to solid fundation.
Do phase space reconstruction apply to problems without model but only data?
Yes, because it's a model-free method. It can predict the future of the system.
Reservoir Computing
Reservoir computing is a machine learning framework for training recurrent neural networks. It was introduced by Herbert Jaeger in 2001, and it is inspired by the echo state network approach of Jürgen Schmidhuber.
Theory:
r(t)=tanh(Winu(t)+Wr(t−1))
where r(t) is the state vector at t, Win is the input weight matrix, u(t) is the input vector at time t, W is the recurrent weight matrix, and tanh is the hyperbolic tangent function.
Training procedure:
Initialize Win and W randomly.
For each training sample u(t), compute r(t) using the above equation.
Compute Wout using ridge regression: Wout=YRT(RRT+λI)−1, where Y is the target output matrix, R is the state matrix, λ is the regularization parameter, and I is the identity matrix.
The target output matrix Y is computed as follows:
Y=RWout(Wout)T(RRT+λI)−1
where R is the state matrix, Wout is the output weight matrix, λ is the regularization parameter, and I is the identity matrix.
Compute Win and W using backpropagation through : ∂Win∂E=∂Y∂E∂Win∂Y, and ∂W∂E=∂Y∂E∂W∂Y, where E is the error function, Y is the target output matrix, Win is the input weight matrix, and W is the recurrent weight matrix.
The error function E is computed as follows:
E=21t=1∑T∥y(t)−Y(t)∥2
where T is the number of steps, y(t) is the output vector at time t, and Y(t) is the target output vector at time t.
Repeat the above steps until convergence.
Use Win, W, and Wout to predict the output.
What is the ridge regression?
Ridge regression is a regularized version of linear regression. It is used to prevent overfitting in linear regression.
Systemetic Prediction based on Periodic Orbits
Monte Carlo Methods
Jiao Wang, XMU, 2023-07-24 13:01:59
Quantum Annealing
Formulation of quantum annealing
Quantum annealing is a metaheuristic for finding the global minimum of a given objective function over a given set of candidate solutions (candidate states), by a process using quantum fluctuations. Quantum annealing is used mainly for problems where the search space is discrete (combinatorial optimization problems) with many local minima; such as finding the ground state of a spin glass.
Transverse field Ising model:
H=−2A(s)i=1∑nσix−2B(s)i<j∑Jijσizσjz
where A(s) is the transverse field, B(s) is the longitudinal field, σix is the Pauli-X operator on the i-th qubit, σiz is the Pauli-Z operator on the i-th qubit, Jij is the interaction strength between the i-th and j-th qubits, and n is the number of qubits.
where H(0) is the Hamiltonian at 0, and H(1) is the Hamiltonian at time 1.
An problem: Phase transition of Ising model
density of defect ρ vs the annealling ts.
how to obtain this relationship ρ(ts)?
Kibble-Zurek mechanism:
The Kibble-Zurek mechanism (KZM) describes the non-equilibrium dynamics of a system undergoing a continuous phase transition. It was proposed by T. W. B. Kibble in 1976 and independently by Wojciech Zurek in 1985. The KZM is based on the idea that when a system is driven through a continuous phase transition at a finite rate, the system remains in thermal equilibrium during the transition. This assumption allows one to estimate the density of topological defects in the system as a function of the rate of transition.
ρ(ts)∼(tQts)21
where tQ is the quantum critical .
the theory is
logρ=log(2π12Jh)−21logts
where h is the transverse field, and J is the interaction strength.
This theory have no experiment parameter, but reaches good alignment with experiment data.
This is the first case to show that quantum annealing can be used to solve a problem that is not efficiently solvable by classical computers using 5300 qubits.
Q: Include the long range interaction, will it be better?
A: Maybe, no verificaton yet.
Q: They reached L=2000 noise-free qubits, but under only 50ns annealing . (environment noise is the main problem)
Wikipedia is doing a great job in democrotizing knowledge.
Dynamics of cell state transition
Jianhua Xing, University of Pittsburgh
Mathematical considerations:
dtdz=A(z)+ξ(z,t)
why looks like this?
A(z) is the deterministic part, ξ(z,t) is the stochastic part.
where z is a state vector specifying the complete internal state of the cell at a given .
The problem became how to find A(z) and ξ(z,t). This can be defined as a variational problem:
Human population behavior and propagation dynamics
Qingyan Hu, 南方科技大学,复杂流动及软物中心,统计与数据科学系
information spreading dynamics
Information spreading dynamics contain three key elements:
Information feature αi
Network structure γ
User attribute fi
We model the information spreading dynamics as:
βi=αix(1−γ)fi(x)
where βi is the probability of user i to spread the information, αi is the information feature of user i, x is the fraction of users who have received the information, γ is the network structure, and fi(x) is the user attribute of user i.
fi(x)=1+e−λi(x−θi)1
where λi and θi are parameters that depend on the user's attribute, such as age, education, income, etc. The paper says that this function is a logistic function that models the user's cognitive psychology. It means that the user's probability of receiving the information depends on how much the information matches their prior knowledge and beliefs.
Classical theory: social enforcement
βi=αix(1−γ)fi(x)+λj∈Ni∑βj
where Ni is the set of neighbors of user i, and λ is the social enforcement parameter.
Have your theory condiserted the inhomoegeneity of different users?
No, αi is the same for all users. But it can be different.
Here we use
fi(x)=ln(1−γ)ln[βi(x)/βi(1)]−lnx+1
where βi(x) is the probability of user i to spread the information when the fraction of users who have received the information is x.
We simplify the model as:
fi(x)=xωi
where ωi is the user attribute of user i.
Disease spreading dynamics
The SIR model is a compartmental model that describes the dynamics of disease spreading in a population. It is a system of ordinary differential equations that models the number of susceptible, infected, and recovered individuals in a population over .
dtdS=−βISdtdI=βIS−γIdtdR=γI
where S is the number of susceptible individuals, I is the number of infected individuals, R is the number of recovered individuals, β is the infection rate, and γ is the recovery rate.
In covid, the quarantine τ is related to R by
R(τ)≈R0QˉTˉcτ−Tˉe
where R0 is the basic reproduction number, Tˉe is the average incubation period, Qˉ is the average quarantine , and Tˉc is the average recovery time.
The ciritical quarantine τc is related to R0 given infinite average degree
τc≈Tˉe+R0QˉTˉc
Have you considerted the spacial factor in disease spreading
No, but can .
Spiking Neural Networks
Integrate-and-fire model:
CdtdV=−gL(V−EL)+I(t)
where C is the membrane capacitance, V is the membrane potential, gL is the leak conductance, EL is the leak reversal potential, and I(t) is the input current.
The membrane potential V is reset to Vr when it reaches the threshold Vth.
V→Vr if V≥Vth
Spiking neural network claims to have low power usage, and are more robust to noise.
The future of computation: Neuromorphic computing
Trainning SNN is hard, because it's not differentiable, so backpropagation cannot be used.
Mimic the brain activity to learn is enough, we don't need to know how the brain works.
Training methods:
Method
Pros
Cons
ANN-SNN conversion
Easy to implement
Low accuracy
Approximate gradient
High accuracy
High complexity
Neuro synaptic dynamics
High accuracy
Low performance
They allpy STDP to train the network.
STDP is a biological process that adjusts the strength of connections between neurons in the brain. The process adjusts the connection strengths based on the relative timing of the input signals to pairs of neurons. The STDP process partially explains the activity-dependent development of nervous systems, especially with regard to long-term potentiation and synaptic plasticity.
Δwij={A+e−τ+Δt−A−e−τ−Δtif Δt>0if Δt<0
where Δwij is the change in the synaptic weight between the i-th and j-th neurons, A+ is the amplitude of the positive change, A− is the amplitude of the negative change, τ+ is the constant of the positive change, τ− is the time constant of the negative change, and Δt is the time difference between the spikes of the i-th and j-th neurons.
They proposed a new method to train SNN called DA-STDP, which is based on the dopamine reward signal.
where α is the reward for the positive change, and β is the reward for the negative change.
the SNN is implemented in hardware
Probalistic inference reformulated as tensor networks
Jin-Guo Liu, HKUST
Reformulate probalistc inference as tensor networks
Probabilistic inference is the task of computing the posterior distribution of the hidden variables given the observed variables.
p(h∣v)=p(v)p(h,v)
where h is the hidden variables, and v is the observed variables.
The joint distribution of the hidden variables and the observed variables is given by:
p(h,v)=Z1i=1∏nϕi(h,v)
where Z is the partition function, and ϕi(h,v) is the potential function.
why the joint distribution is like this?
This is the definition of the joint distribution.
Graph probabilistic model: the joint distribution is given by:
p(h,v)=Z1i=1∏nϕi(hi,vi)j=1∏mϕj(hj)
where hi is the hidden variables of the i-th node, vi is the observed variables of the i-th node, ϕi(hi,vi) is the potential function of the i-th node, ϕj(hj) is the potential function of the j-th edge, and Z is the partition function.
Where is the graph here?
The graph is the structure of the joint distribution.
Probabilistic Model:
<!-- See TensorInference.jl -->
Variational Autoregressive Network
Pan Zhang , ITP, 2023-07-28 14:03:12
Variational methods in statistical mechanics
Mean field assumptions:
q(s)=i=1∏nqi(si)
where q(s) is the variational distribution, qi(si) is the variational distribution of the i-th spin.
where q(s) is the variational distribution, qi(si) is the variational distribution of the i-th spin, and qij(si,sj) is the variational distribution of the i-th and j-th spins.
Chemical reaction simulation with VAN
See lecture note of Yin Tang.
Machine learning and chaos
Question:
Given
x˙=f(x,r)
where
f(a+b)=f(a)+f(b)
Predict the final x given the initial x0.
A simplist example is logistic map
xn+1=rxn(1−xn)
For stability, r must be restricted to [0,4], we found that for some r, the period of the system T→∞.
Simple code to replicate it
def logistic(a):
x = [0.3]
for i in range(400):
x.append(a * x[-1] * (1 - x[-1]))
return x[-100:]
for a in np.linspace(2.0, 4.0, 1000):
x = logistic(a)
plt.plot([a]*len(x), x, "c.", markersize=0.1)
plt.xlabel("r")
plt.ylabel("x_f")
plt.show()
Another example is the Lorenz system:
dtdx=σ(y−x)dtdy=x(ρ−z)−ydtdz=xy−βz
where x, y, and z are the state variables, σ, ρ, and β are the parameters, and t is the .
Fractional ordinary differential equation (FODE):
dtαdαx=f(x,t)
where α is the order of the fractional derivative, x is the state variable, t is the , and f(x,t) is the function of the state variable and the time.
For example, the fractional derivative of order α of the function x(t) is given by:
dtαdαx=Γ(1−α)1dtd∫0t(t−t′)αx(t′)dt′
where Γ is the gamma function.
What is the physical meaning of the fractional derivative?
It is the memory of the system. The fractional derivative of order α is the memory of the system of length α.
Feature of chaos
Sensitivity to initial conditions
This is the property that the system is sensitive to initial conditions.
Mathematical definition:
∃ϵ>0,∀x∈O,∃y∈O,∃n∈N, such that ∥fn(x)−fn(y)∥>ϵ
Lynapunov exponent:
λ=t→∞limt1logdx(0)dx(t)
where λ is the Lynapunov exponent, x(t) is the state variable at t, and x(0) is the state variable at time 0.
Example: logistic map with r=4 and x0=0.2 has λ=0.69.
If λ>0, the system is chaotic.
Topological mixing
This is the property that the system will eventually reach any state in the phase space.
Matheatical definition:
∀U,V∈O,∃N∈N,∀n≥N,fn(U)∩V=∅
where O is the phase space, U and V are two open sets in the phase space, N is a natural number, n is a natural number, fn(U) is the n-th iteration of the set U, and ∅ is the empty set.
Dense periodic orbits
This is the property that the system has infinite periodic orbits.
Mathematical definition:
∀x∈O,∀ϵ>0,∃y∈O,∃n∈N, such that ∥fn(x)−y∥<ϵ
where O is the phase space, x is a point in the phase space, ϵ is a positive real number, y is a point in the phase space, n is a natural number, fn(x) is the n-th iteration of the point x, and ∥fn(x)−y∥ is the distance between the n-th iteration of the point x and the point y.
Sensitive to initial conditions
How ML is doing better than traditional methods?
Two questions in chaos study, under the condition that the system dynamics is unknown(model-free):
predict chaos evolution
infer bifurcation diagram
Essence of statistical learning: i.i.d. assumption
Chaos synchronization
Chaos synchronization: given a coupled chaotic oscillator, we want to synchronize the two oscillators.
Oscillator 1:
dtdx1=f(x1)+ϵ(x2−x1)
Oscillator 2:
dtdx2=f(x2)+ϵ(x1−x2)
where x1 is the state variable of oscillator 1, x2 is the state variable of oscillator 2, f(x1) is the dynamics of oscillator 1, f(x2) is the dynamics of oscillator 2, and ϵ is the coupling strength. This coupling is called linear coupling, or diffusive coupling, because the coupling term is proportional to the difference between the two oscillators.
How is this related to diffusion?
The coupling term is proportional to the difference between the two oscillators, which is similar to the diffusion term in the diffusion equation: ∂t∂u=D∂x2∂2u.
complete synchronization
(x1,y1,z1)=(x2,y2,z2),ϵ>ϵc
which means that the two oscillators have the same state variables when the coupling strength is greater than the critical coupling strength.
phase synchronization
θ1=θ2+c,ϵ>ϵp
where θ1 is the phase of oscillator 1, θ2 is the phase of oscillator 2, c is a constant, and ϵp is the critical coupling strength. This means that the two oscillators have the same phase when the coupling strength is greater than the critical coupling strength.
How is the pahse θ defined ?
θ is the angle of the oscillator in the phase space. For example, for the logistic map, θ is the angle of the point (xn,xn+1) in the phase space.
Generalized synchronization
x1=G(x2),ϵ>ϵg
where x1 is the state variable of oscillator 1, x2 is the state variable of oscillator 2, G is a function, and ϵg is the critical coupling strength. This means that the two oscillators have the same state variables when the coupling strength is greater than the critical coupling strength.
Science is good, but engineering will make it great.
Why is chaos synchronization important?
Real systems are coupled, and chaos synchronization is a way to model the coupling.
Can we predict the synchronization point ϵc given some x1,2(ϵi,t),i∈{1,2,3}?
One method: apply reservior computing.
To produce predictions for an unknow non-linear system is in principle difficult. Here we review some methods to predict chaotic system based on data/model, trying to grasp the essence of chaos.
Definition and Characterization of Chaos
Question:
Given
x˙=f(x,r)
where
f(a+b)=f(a)+f(b)
Predict the final x given the initial x0.
A simplist example is logistic map
xn+1=rxn(1−xn)
For stability, r must be restricted to [0,4], we found that for some r, the period of the system T→∞.
Simple code to replicate it
def logistic(a):
x = [0.3]
for i in range(400):
x.append(a * x[-1] * (1 - x[-1]))
return x[-100:]
for a in np.linspace(2.0, 4.0, 1000):
x = logistic(a)
plt.plot([a]*len(x), x, "c.", markersize=0.1)
plt.xlabel("r")
plt.ylabel("x_f")
plt.show()
Another example is the Lorenz system:
dtdx=σ(y−x)dtdy=x(ρ−z)−ydtdz=xy−βz
where x, y, and z are the state variables, σ, ρ, and β are the parameters, and t is the time. The Lorenz system has chaotic solutions for some parameter values and initial conditions. In particular, for σ=10, β=8/3, and ρ=28, the Lorenz system exhibits chaotic solutions for many initial conditions.
Feature of chaos
Sensitivity to initial conditions
This is the property that the system is sensitive to initial conditions.
Mathematical definition:
∃ϵ>0,∀x∈O,∃y∈O,∃n∈N, such that ∥fn(x)−fn(y)∥>ϵ
Lynapunov exponent:
λ=t→∞limt1logdx(0)dx(t)
where λ is the Lynapunov exponent, x(t) is the state variable at t, and x(0) is the state variable at time 0.
Example: logistic map with r=4 and x0=0.2 has λ=0.69.
If λ>0, the system is chaotic.
Topological mixing
This is the property that the system will eventually reach any state in the phase space.
Matheatical definition:
∀U,V∈O,∃N∈N,∀n≥N,fn(U)∩V=∅
where O is the phase space, U and V are two open sets in the phase space, N is a natural number, n is a natural number, fn(U) is the n-th iteration of the set U, and ∅ is the empty set.
Dense periodic orbits
This is the property that the system has infinite periodic orbits.
Mathematical definition:
∀x∈O,∀ϵ>0,∃y∈O,∃n∈N, such that ∥fn(x)−y∥<ϵ
where O is the phase space, x is a point in the phase space, ϵ is a positive real number, y is a point in the phase space, n is a natural number, fn(x) is the n-th iteration of the point x, and ∥fn(x)−y∥ is the distance between the n-th iteration of the point x and the point y.
Sensitive to initial conditions
Prediction of Chaos
Two questions in chaos study, under the condition that the system dynamics is unknown(model-free):
predict chaos evolution
infer bifurcation diagram
A paper in 2001 use Reservoir Computing to predict chaos evolution, which is a model-free method. Here we reproduce it.
Problem Formulation
Chaos synchronization: given a coupled chaotic oscillator, we want to synchronize the two oscillators.
Oscillator 1:
dtdx1=f(x1)+ϵ(x2−x1)
Oscillator 2:
dtdx2=f(x2)+ϵ(x1−x2)
where x1 is the state variable of oscillator 1, x2 is the state variable of oscillator 2, f(x1) is the dynamics of oscillator 1, f(x2) is the dynamics of oscillator 2, and ϵ is the coupling strength. This coupling is called linear coupling, or diffusive coupling, because the coupling term is proportional to the difference between the two oscillators.
How is this related to diffusion?
The coupling term is proportional to the difference between the two oscillators, which is similar to the diffusion term in the diffusion equation: ∂t∂u=D∂x2∂2u.
complete synchronization
(x1,y1,z1)=(x2,y2,z2),ϵ>ϵc
which means that the two oscillators have the same state variables when the coupling strength is greater than the critical coupling strength.
phase synchronization
θ1=θ2+c,ϵ>ϵp
where θ1 is the phase of oscillator 1, θ2 is the phase of oscillator 2, c is a constant, and ϵp is the critical coupling strength. This means that the two oscillators have the same phase when the coupling strength is greater than the critical coupling strength.
How is the pahse θ defined ?
θ is the angle of the oscillator in the phase space. For example, for the logistic map, θ is the angle of the point (xn,xn+1) in the phase space.
Generalized synchronization
x1=G(x2),ϵ>ϵg
where x1 is the state variable of oscillator 1, x2 is the state variable of oscillator 2, G is a function, and ϵg is the critical coupling strength. This means that the two oscillators have the same state variables when the coupling strength is greater than the critical coupling strength.
Science is good, but engineering will make it great.
Why is chaos synchronization important?
Real systems are coupled, and chaos synchronization is a way to model the coupling.
Can we predict the synchronization point ϵc given some x1,2(ϵi,t),i∈{1,2,3}?
Generate Data
One data pair looks like this:
We can visualize the data by plotting the observations of the first system against the observations of the second system for each epsilon value.
This displays the dynamic coupling between the two systems. This can be seen as a visualization of the phase space of the coupled system.
Another visualization look at the final state difference (∣x1−x2∣) of the system for each epsilon value.
Which indicate a larger bias when ϵ is large.
Weired right? We expect the system to synchronize!
To better understand the synchronization, we plot the coefficient between x1 and x2.
This shows that the synchronization is not perfect, but the correlation is approaching -1 when ϵ is large.
Anyway, we can use the data to train a model to predict the synchronization point ϵc.
Create a Reservoir
A reservoir can be abstractly thought of as a dynamical system xn+1=f(xn), where xn is the state vector at time n and f is the reservoir function. By supervised training, we can fit the reservoir to a given data set.
The community created a package to create reservoirs and apply it to dynamical system prediction.
Here we review one of it's notebook to see how it works.
A (very) short tutorial on Reservoir Computing
Data format must be (time, features)
e.g. (1000, 3) for 3 features and 1000 time steps.
If multiple sequences are provided, the data format must be (sequences, time, features)
e.g. (10, 1000, 3) for 10 sequences of 3 features and 1000 time steps.
Reservoir: neurons randomly connected to their inputs and to themselves, not trainnable, randomly initialized under some constraints.
Readout: a decoder with a single layer of neurons, trainnable with a linear regression.
No backpropagation needed! See ridge regression.
feedback: Readout neuron can be connected to the reservoir, to tame the reservoir dynamics. (An optional)
tame: make the reservoir dynamics more predictable.
State vector: the reservoir state at time t is the concatenation of the reservoir neurons states at time t.
For every time step, the reservoir state is a vector of length N, where N is the number of neurons in the reservoir.
xt=[xt,1,xt,2,...,xt,N], where xt,i=f(Winut+Wxt−1)i.
This xt will be stored for later use. It has the shape of (time, neurons). For example, if the reservoir has 30 neurons and the data has 100 time steps, then the state vector has the shape of (100, 30).
ESN: Echo State Network, a type of reservoir computing.
where ut+1 is the wave distribution at time t+1, ut is the wave distribution at time t, ut−1 is the wave distribution at time t−1, I is the identity matrix, Δt is the time step, and ∇2 is the Laplace operator.
We can define the hidden state ht=[utut−1], the weight matrix W=[2I−c2Δt2∇2I−I0], and the input xt=[ft0], then we have
ht+1=Wht+Δt2xtyt+1=[P(0)ht+1]2
where ht+1 is the hidden state at time t+1, W is the weight matrix, ht is the hidden state at time t, xt is the input at time t, yt+1 is the output at time t+1, and P(0) is the projection operator.
How is the training done?
No backpropagation needed. Pseudo-inverse is used to train the network.
Pseudo-inverse is defined as:
A+=(ATA)−1AT
where A+ is the pseudo-inverse of A, AT is the transpose of A, and A−1 is the inverse of A.
It's called psuedo-inverse because it is not defined when ATA is not invertible.
To train an optical neural network, we need to solve the following equation:
Y=XW
The solution using pseudo-inverse is:
W=X+Y=(XTX)−1XTY
This is only linear , due to the difficulty in replicating non-linear activations.
How quantum grover algorithm can be mapped to ML
Grover algorithm is a quantum algorithm that finds a specific element in an unsorted list with high probability.
To find the element x in the list L, we need to find the index i such that L[i]=x.
Steps of Grover algorithm:
Initialize the state ∣ψ0⟩=N1∑i=0N−1∣i⟩, where N is the number of elements in the list.
Apply the Hadamard gate to the state ∣ψ0⟩ to get the state ∣ψ1⟩=N1∑i=0N−1∣i⟩.
Apply the oracle operator O to the state ∣ψ1⟩ to get the state ∣ψ2⟩=N1∑i=0N−1(−1)f(i)∣i⟩, where f(i)=1 if L[i]=x, and f(i)=0 if L[i]=x.
Apply the diffusion operator D to the state ∣ψ2⟩ to get the state ∣ψ3⟩=N1∑i=0N−1(−1)f(i)∣i⟩.
Repeat steps 3 and 4 for k times.
Measure the state ∣ψ3⟩ to get the index i.
Why this works, an intuitive explanation?
The oracle operator O flips the sign of the state ∣i⟩ if L[i]=x, and does nothing if L[i]=x. The diffusion operator D flips the sign of the state ∣i⟩ if L[i]=x, and does nothing if L[i]=x.
So the oracle operator O and the diffusion operator D together can amplify the amplitude of the state ∣i⟩ if L[i]=x, and reduce the amplitude of the state ∣i⟩ if L[i]=x. So after k iterations, the state ∣i⟩ with L[i]=x will have a much higher amplitude than the state ∣i⟩ with L[i]=x, so we can measure the state ∣i⟩ to get the index i.
Real intelligence is creation.
看视频的智能<<创造视频的智能
Implementation of Grover algorithm using optical neural network:
Using surface wave materials to implement the oracle operator O and the diffusion operator D?
Diffusion System
Dimension reduction is the key to understand the world. It's the essence of understanding.
Two ingredient of Learning:
Manifold learning
Diffusion mapping
Manifold learning
<!-- To be Continued -->
Diffusion mapping
Given data {xn}n=1N,xn∈Rp, we define the distance between xi and xj as:
Ai,j=exp(−ϵ∥xi−xj∥2)
where Ai,j is the distance between xi and xj, xi is the i-th data point, xj is the j-th data point, ϵ is the parameter.
Gaussian kernel is used here, because it is a smooth function.
We define the diagonal matrix D as:
Di,i=j=1∑NAi,j
where Di,i is the i-th diagonal element of D, and Ai,j is the distance between xi and xj.
We define the probability matrix P as:
P=D−1A
where P is the probability matrix, D is the diagonal matrix, and A is the matrix.
Then the Laplacian matrix L is defined as:
L=I−P
where L is the Laplacian matrix, I is the identity matrix, and P is the probability matrix.
The equation of motion can be written as:
∂t∂p(t)=−Lp(t)
which is the diffusion equation. Where p(t) is the probability distribution at time t, and L is the Laplacian matrix.
what ds the difference between row and column normalization?
They represent different physics systems. Row normalization is diffusion, and column normalization is heat conduction. What makes them different?
Google PageRank is essentially this idea, but it developed a faster algorithm. (well, is left and wright different?)
Google implement a α≈0.85 that stops the chain from trapping. But the problem we are dealling here require n eigen value and vector, and is working on a much smaller system.
Any way, what left is to compute the final distribuiton of the chain.
Find solution for
Px=x
Ergodicity? It dependes on the connectivity of the graph
This method claims to reached a new representation of the data, and can define a new distance ρ between the data points.
<!-- Under construction -->
Langevin diffusion
Focker-Planck equation:
∂t∂p=−∇⋅(pv)+∇⋅(D∇p)
where p is the probability distribution, t is the time, v is the velocity, and D is the diffusion tensor.
Langevin equation:
x˙=−∇U(x)+f+2Dη
where x is the position, U(x) is the potential, f is the force, D is the diffusion coefficient, and η is the Gaussian white noise.
what if this is not gaussian noise.
The equivalence between the Focker-Planck equation and the Langevin equation:
Topological phonon Hall effect is the phenomenon that the phonon Hall conductance is quantized in a topological insulator, so the heat can only be conducted in one direction.
Theory:
H=21pTp+21uT(K−A2)u+uTAp
where H is the Hamiltonian, p is the momentum, u is the displacement, K is the stiffness matrix, and A is the matrix of spin-latice coupling.
What is the physical meaning of the stiffness matrix?
where J is the phonon Hall conductance, ℏ is the reduced Planck constant, V is the volume, σ is the spin, k is the wave vector, ωσ,k is the phonon frequency, aσ,k is the phonon annihilation operator, Ω is the Berry connection, and A is the matrix of spin-latice coupling.
The normal velocity ∂k∂Ω2 is responsible for the longitudinal phonon transport. The Berry curvature [A,Ω2] is responsible for the transverse phonon transport.
How theory classify different phases?
By the topological invariant, e.g. Chern number.
Problems arise when we are dealing with amorphous topological phonon strucutures, because the topological invariant is not well defined.
Why classification of different phase important?
Recommendation system
Previous diffusion model, we define
P=AD−1
but for heat conduction, we define
P=D−1A
What is the difference between the two?
Recommedation system is a system that recommends items to users based on their preferences, if we denote the preference of user i for item j as rij, then we have
rij=k=1∑Kuikvkj
where rij is the preference of user i for item j, uik is the preference of user i for feature k, vkj is the preference of item j for feature k, and K is the number of features.
Consider a book selling situation, here item is book, user is reader, and feature is genre.
The central challenge in recommendation system:
After some tedious derivaion, our gioal is still find the eigen values and eigen vectors. It apply finite iterations to do this. (with truncation)
Machine learning stochastic dynamics
yin Tang, BNU, 2023-07-31 13:04:59
Dynamic system can be roughly divided into deterministic and stohastic
Here we have 4D, but what we commonly see is 2D, this is because we have two directions, and the diffusion coefficient is the average of the two directions.
D=21(Dx+Dy)
So
P(x,t)=2πDt1e−2Dtx2
is for 1- Ddiffusion, and 4D is for 2-D diffusion.
For 3-D diffusion, it is 6D.
If you got 7 solutions to a problem, you finally understand it. -- Feynman
A Random Walk Down Wall Street: The Time-Tested Strategy for Successful Investing. -- Burton G. Malkiel
An experiemnt gives D=51.1μ m2/s of 大肠杆菌(E. coli), we can derive
⟨x2⟩=2Dt
To walk for around x=1 cm, we need t=106 s, which is around 11 days.
This is too slow.
If we put food to the side, it will have a drift velocity v, and the equation becomes
∂t∂P=D∂x2∂2P−v∂x∂P
where P is the probability distribution, t is the time, D is the diffusion coefficient, x is the position, and v is the drift velocity.
Random walk is everywhere, e.g. complex network, search problem, complex system, and anomalous diffusion.
Every stochastic process can be mapped to a random walk.
seriously ?
Chemical master equation
Chemical master equation is a stochastic model for chemical reactions. It is stochastic with fluctuations, many chemical species beyond a single "walker".
One specis., one reactin: x→k∅, which means that the species x decays to nothing with rate k.
Continuous: Reaction Rate Equation
dtdx=−kx+noise
which is the rate of change of the number of molecules.
Discrete: Chemical Master Equation
∂tPt(n)=−k(n+1)Pt(n+1)−knPt(n)
where Pt(n) is the probability distribution, t is the time, n is the number of molecules, and k is the reaction rate. This function is intuitively the probability of having n molecules at time t.
The discre description is more accurate, but it is hard to solve for large n.
where Pt(n) is the probability distribution, t is the time, n is the number of molecules, B(n) is the birth rate, F(n) is the death rate, k1 is the birth rate constant, and k2 is the death rate constant.
We use characteristic line method, which is based on the idea that the solution of the PDE is constant along the characteristic lines, which are the integral curves of the vector field (−1,−k1).
An intuitive explanation: the solution of the PDE is constant along the characteristic lines, because the PDE is the rate of change of the solution, and the characteristic lines are the lines that the solution is constant.
# Import librariesimport numpy as np
import matplotlib.pyplot as plt
# Define the functions a, b, c, ddefa(x,y):
return1defb(x,y):
return1defc(x,y):
return0defd(x,y):
return0# Define the boundary conditions for udefu_left(y):
return np.sin(y)
defu_right(y):
return np.cos(y)
# Define the domain and the grid size
x_min = 0
x_max = 1
y_min = 0
y_max = np.pi
n_x = 11# number of grid points in x direction
n_y = 11# number of grid points in y direction
h_x = (x_max - x_min) / (n_x - 1) # grid spacing in x direction
h_y = (y_max - y_min) / (n_y - 1) # grid spacing in y direction# Initialize the grid and the solution matrix
x = np.linspace(x_min, x_max, n_x)
y = np.linspace(y_min, y_max, n_y)
u = np.zeros((n_x, n_y))
# Define a function that computes the slope of the characteristic curvedefslope(x,y):
return b(x,y) / a(x,y)
# Define a function that computes the ODE along the characteristic curvedefode(x,y,u):
return (d(x,y) - c(x,y) * u) / a(x,y)
# Define a numerical method to solve the ODE (Euler's method)defeuler(x,y,u,h):
return u + h * ode(x,y,u)
# Loop over the grid pointsfor i inrange(n_x):
for j inrange(n_y):
# Check if the point is on the left or right boundaryif i == 0:
# Use the left boundary condition
u[i,j] = u_left(y[j])
elif i == n_x - 1:
# Use the right boundary condition
u[i,j] = u_right(y[j])
else:
# Use the characteristic line method# Find the previous point on the characteristic curve
x_prev = x[i-1]
y_prev = y[j] - h_x * slope(x_prev, y[j])
# Interpolate the value of u at the previous point
u_prev = np.interp(y_prev, y, u[i-1,:])
# Solve the ODE along the characteristic curve
u[i,j] = euler(x_prev, y_prev, u_prev, h_x)
# Plot or output the solution
plt.contourf(x, y, u.T)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Solution of PDE by characteristic line method')
plt.colorbar()
plt.show()
This image is a solution for the PDE
∂tη(y,t)+∂yη=0
with the boundary conditions η(0,t)=sin(t) and η(π,t)=cos(t).
Gillespie algorithm
Gillespie algorithm is a stochastic simulation algorithm for chemical master equation. It is a Monte Carlo method that simulates the time evolution of a chemical system.
A pseudo-code of the algorithm:
# Define the system with the initial number of molecules and the reaction rate constants
# Initialize the time to zero
# Loop until the end time or condition is reached
# Calculate the total propensity function by summing all the reaction propensities
# Generate two random numbers from a uniform distribution between 0 and 1
# Use one random number to determine the time interval until the next reaction event
# Use another random number to determine which reaction will occur next
# Update the system by increasing the time by the time interval and changing the number of molecules according to the chosen reaction
# End loop
# Plot or output the results
This algorithm is based on the following theory:
The time interval until the next reaction event is an exponential random variable with the rate parameter equal to the total propensity function: t∼Exp(α0)
The probability of each reaction event is proportional to its propensity function: Pr(Ri)=α0αi
The number of molecules after each reaction event is a multinomial random variable with the number of trials equal to the number of molecules before the reaction event and the probability of each outcome equal to the probability of each reaction event: X∼Multinomial(n,Pr(Ri))
The number of molecules after each reaction event is a Poisson random variable with the rate parameter equal to the propensity function: X∼Poisson(αi)
# Import librariesimport numpy as np
import matplotlib.pyplot as plt
# Define the system with two reactions: A -> B and B -> A# The initial number of molecules for A and B are 10 and 0, respectively# The reaction rate constants for A -> B and B -> A are 1.0 and 0.5, respectively
x_A = 10# number of A molecules
x_B = 0# number of B molecules
k_1 = 1.0# rate constant for A -> B
k_2 = 0.5# rate constant for B -> A# Initialize the time to zero
t = 0# Create empty lists to store the time and molecule values
t_list = []
x_A_list = []
x_B_list = []
# Loop until the end time of 10 is reachedwhile t < 10:
# Calculate the total propensity function
a_total = k_1 * x_A + k_2 * x_B
# Generate two random numbers from a uniform distribution between 0 and 1
r_1 = np.random.uniform(0,1)
r_2 = np.random.uniform(0,1)
# Use one random number to determine the time interval until the next reaction event
tau = (1 / a_total) * np.log(1 / r_1)
# Use another random number to determine which reaction will occur nextif r_2 < (k_1 * x_A) / a_total:
# A -> B occurs
x_A -= 1# decrease A by 1
x_B += 1# increase B by 1else:
# B -> A occurs
x_A += 1# increase A by 1
x_B -= 1# decrease B by 1# Update the time by adding the time interval
t += tau
# Append the time and molecule values to the lists
t_list.append(t)
x_A_list.append(x_A)
x_B_list.append(x_B)
# Plot or output the results
plt.plot(t_list, x_A_list, label="A")
plt.plot(t_list, x_B_list, label="B")
plt.xlabel("Time")
plt.ylabel("Number of molecules")
plt.title("Gillespie algorithm for a simple chemical system")
plt.legend()
plt.show()
Gaurateen of this algorihtm:
Theorem: Two random variable can simulate a random walk process. (?)
Take-home message:
Diffusion coefficient is 2dD, where d=1,2,3 is the dimension, and D is the diffusion coefficient.
Stochastic reaction networks with M species (each with count N)
Then we can derive the final state of the system with the matrix r,p,c. (The matrix r is the stoichiometry matrix, p is the propensity matrix, and c is the reaction rate matrix.)
c=p−r
The chemical transformation matrx T is defined as
A numeric example:
∅1AA1∅
We have
r=[−11]p=[11]c=[20]
The joint probability Pt(n) can be difficult to handle, since n is a vector of length NM.
We can propose a parameterization of the joint probability Ptθt(n), and then minimize the loss defined as
L=DKL(Pt+δtθt+δt(n)∣∣TPtθt(n))
where DKL is the Kullback-Leibler divergence, Pt+δtθt+δt(n) is the joint probability at time t+δt with parameter θt+δt, Ptθt(n) is the joint probability at time t with parameter θt, and T is the chemical transformation matrix.
The dimension of the matrix T is NM×NM, where N is the number of molecules, and M is the number of species.
The loss function is the Kullback-Leibler divergence between the joint probability at time t+δt and the joint probability at time t transformed by the chemical transformation matrix.
Why not implement the time t as an embedding?
Tried , not promising due to the long step negative probability problem.
Is this size big enough? Given N and M, the distribuiton is of dimension NM, where n indicates the number state of he system
e.g. ∣1,0,0⟩ means there is one molecule of species 1, and no molecule of species 2 and 3.
Learning nonequilibrium statistical mechanics and dynamical phase transitions
Ying Tang, International Academic Center of Complex Systems, Beijing Normal University
Non-equilibrium statistical mechanics is the study of the behavior of systems that are not in thermodynamic equilibrium. Most systems found in nature are not in thermodynamic equilibrium because they are not in stationary states, and are continuously and discontinuously subject to flux of matter and energy to and from other systems and to and from their environment.
dtd∣Pt⟩=W∣Pt⟩
where ∣Pt⟩ is the probability distribution at t, and W is the generator of the Markov process.
∣P⟩=x∑P(x)∣x⟩∣x⟩=(x1,x2,…,xn)
where ∣P⟩ is the probability distribution, P(x) is the probability of the state x, and ∣x⟩ is the state.
Kinetically constrained models (KCMs) are lattice models of interacting particles that are subject to constraints on their dynamics. They are used to model the dynamics of glassy systems.
Why KCMs can be used to model the dynamics of glassy systems?
It provide a const
FA model, South-East model
WKCM=i∑fi(cσi++(1−c)σi−−c(1−ni)−(1−c)ni)
where fi is the rate of the i-th site, c is the constraint parameter, σi+ is the creation operator on the i-th site, σi− is the annihilation operator on the i-th site, ni is the occupation number on the i-th site.
How to determine the way to flip the spin?
This can be done in the following way: flip the spin with probability c if the site is occupied, and flip the spin with probability 1−c if the site is unoccupied.
Difficulty in applying this method to 3D systems:
The number of states is too large.
ML can be used to estimate the dynamic partition function:
Z(t)=x∑e−βE(x)P(x,t)
where Z(t) is the dynamic partition function, E(x) is the energy of the state x, and P(x,t) is the probability of the state x at t.
How to estimate the dynamic partition function?
We may use autorergressive model to estimate the dynamic partition function:
The dynamical partition funtion is the moment generating function with the counting field s:
Zt(s)=ωt∑e−∑i=1IsiE(xti)P(ωt)
Here counting field s is ?
This is the only work (the lecturer done) that use NN to observe things others don't.
Track the distribuiton oin Ornstein-Uhlenbeck process
Stochastic differential equation:
x˙=−kx+Dξ(t)
Fokker-Planck equation:
∂t∂P=∂x[kxρ(x,t)]+21D∂x2[ρ(x,t)]
where P is the probability distribution, t is the time, x is the position, k is the drift coefficient, D is the diffusion coefficient, and ξ(t) is the Gaussian white noise.
How to solve it?
path integral approach
This is the Langevin equation of the Ornstein-Uhlenbeck process, and can be analytically solved:
where P(xN,tN∣x0,t0) is the probability distribution of the position x at time tN given the position x0 at time t0, k is the drift coefficient, D is the diffusion coefficient, tN is the final time, and t0 is the initial time.
How is this connected to Stochastic gradient discent?
SGD works as follows:
xn+1=xn−η∇f(xn)+2ηDξn
Under continuum limit, we have
x˙=−∇f(x)+2Dξ(t)
Use flow model to solve Fokker-Planck equation: not introduced in detail. (work in progress)
Machine Learning, Statistical Physic, and Complex System
Xiaosong Chen, YNU, 2023-08-01 13:08:31
Appy it to some toy models, to see if this blackbox model is expalainable.
Long-range connected 2-D network percolation
This model exhibits a phase transition from a non-percolating phase to a percolating phase.
where p(x0) is the marginal likelihood, p(x0,z0) is the joint likelihood, q(z0) is the variational distribution, and L(q) is the ELBO.
Fluctuation theorem
Fluctuation theorem is a theorem in statistical mechanics that describes the probability distribution of the time-averaged irreversible entropy production of a system that is arbitrarily far from equilibrium.
P(−σ)P(σ)=eσ
or
p0[−W0;ϵQλ]p0[W0;λ]=eβW0
where P(σ) is the probability distribution of the time-averaged irreversible entropy production σ, p0[W0;λ] is the probability distribution of the work W0 done on the system, λ is the control parameter, ϵQ is the quantum of energy, and β is the inverse temperature.
Diffusion Model
Foward Process: x0→xt→xt+1→xT
The forward diffusion process is defined by a series of Gaussians with first order Markov property.
🔥 In fact, Lagevin dynamics is also a kind of Markovian Process with detailed balance , so it also satisfies Jarzyskin Equality .[proved in his 1997 paper]
Snippt from that paper
xt=xt−1+2δ∇xlogq(xt−1)+δϵt,where ϵt∼N(0,I),δ is the step size
🔥 Aha! We are in the end learning the noise!
BUT WAIT!
The above weighted MSE losses were found unstable in training. DDPM (Ho et al 2020) instead use a simplified loss without the weighting term
But we still need Σθ, they use βt instead.
Nichol & Dhariwal (2021) proposed a new approach to learn Σθ (And achived SOTA performance, of course), check it out if interested.
🧐 Lilian Weng: if βt is small, it will be.
Proved by: Feller, William. "On the theory of stochastic processes, with particular reference to applications." Proceedings of the [First] Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, 1949. [No Papers Avaliable]
🔥 Here comes the bold idea!
We may use a Neural Network to predict x0 using xt, namely x0=Φ(xt)
we train the network using the loss ∥x0−Φ(xt)∥2
after that, we can say p(xt−1∣xt)≈p(xt−1∣xt,x0=Φ(xt)) , or
🍀 Jianlin Su (Engineer*,*ZhuiYi AI) provided in-depth derivation on DDIM.
Yang Song (PhD, Stanford) introduced the connection of Diffusion Model to Score-based Models.
“Intractability of the likelihood is one of the defining factors of an implicit model, evidenced by the fact that the terms implicit and likelihood-free are often used interchangeably, and the fact that the above paper exists to deal with learning in implicit generative models because the likelihood is intractable.
Your model has a tractable lower bound on the likelihood which you use for training, and only becomes deterministic at test time in the limit of a scalar hyperparameter. I also do not understand how a normalizing flow could be described as an implicit model. ”
But they do accelerated the sampling process
🔥 Less steps
A Chinese quote “取法乎上,仅得乎中”——《帝范》
How to generate two images interpolation using generative model?
An intuitive way: make it noisy again, and interpolate noise?
Guided diffusion: ?
How GAN interpolate ?
Three color lnk in a non-newton liquid
AI4Materials
Element table + spatial group = Crystal structure
Optimization
Find material strucutre with high free energy and thermo-electrical value.
Boids algorithm: Birds as agents with the following rule:
Separation: avoid crowding neighbors (short range repulsion)
Alignment: steer towards average heading of neighbors
Cohesion: steer towards average position of neighbors (long range attraction)
The problem of new material design lies in the ability to grow that material.
Synthesis is another proess that can be reformulated as AI problem.
Automation is the key
A new Matterial that does not absorb sun radiation, but only the radiation of the universe background, which makes the temperature lower.
众包平台: similar to the ones for protein design and protein folding.
Nature is the best teacher
"Go for the mess".
Close speech
Find prosperious direction
Iteration:
Iterate is the key in building something great. Instead of start everything from scratch, we would better build things on top of existing attempts.
Some quotes from Sam Altman (Founder/CEO of OpenAI):